MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
نویسندگان
چکیده
Motivation The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. Results We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. Availability and implementation The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
The sequence read archive: explosive growth of sequencing data
New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key...
متن کاملMolecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients
Objective: This study aimed to investigate the association between HLA alleles and visceral leishmaniasis (VL) in a sample of Iraqi patients. Methods: A total of 30 patients were studied, in addition to 20 age, gender and ethnicity matched controls. All subjects were genotyped by polymerase chain reaction-sequence specific primers (PCR-SSP) method. Results: For HLA-class I region (A and B loci)...
متن کاملMolecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients
Objective: This study aimed to investigate the association between HLA alleles and visceral leishmaniasis (VL) in a sample of Iraqi patients. Methods: A total of 30 patients were studied, in addition to 20 age, gender and ethnicity matched controls. All subjects were genotyped by polymerase chain reaction-sequence specific primers (PCR-SSP) method. Results: For HLA-class I region (A and B loci)...
متن کاملData and Methods for the Production of National Population Estimates: An Overview and Analysis of Available Metadata
Thomas Spoorenberg Translated by: Elham Fathi Statistical Center of Iran Abstract. Official population estimates can be produced using a variety of data sources and methods. These range from the direct extraction of information from continuously updated population registers to procedures for updating the status of a population enumerated previously in a periodic census. Additional sources and ...
متن کاملCalculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 33 شماره
صفحات -
تاریخ انتشار 2017